Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HealthCheck log output options #23900

Merged
merged 1 commit into from
Sep 26, 2024

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Sep 9, 2024

This PR creates three new flags that can affect the output of the HealtCheck log.

Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via podman inspect. It is also limited to the last five executions and the first 500 characters per execution.

This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic healthcheck service failed record.

  • The --health-log-destination flag sets the destination of the HealthCheck log.

    • none: (default behavior) HealthCheckResults are stored in overlay containers. (For example: ./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log)
    • directory: creates a log file named <container-ID>-healthcheck.log with JSON HealthCheckResults in the specified directory.
    • events_logger: The log will be written with logging mechanism set by events_logger.
  • The --health-max-log-count flag sets the maximum number of attempts in the HealthCheck log file.

    • A value of 0 indicates an infinite number of attempts in the log file.
    • The default value is 5 attempts in the log file.
  • The --health-max-log-size flag sets the maximum length of the log stored.

    • A value of 0 indicates an infinite log length.
    • The default value is 500 log characters.

Does this PR introduce a user-facing change?

Added --health-log-destination, --health-max-log-count and --health-max-log-size flags that affect HealtCheck log output.

Fixes: RHEL-24623

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Sep 9, 2024
@Honny1 Honny1 force-pushed the healthcheck-log branch 11 times, most recently from 331ce42 to ebadb8b Compare September 10, 2024 15:58
@containers containers deleted a comment from packit-as-a-service bot Sep 10, 2024
@Honny1 Honny1 force-pushed the healthcheck-log branch 16 times, most recently from bf7e1cb to 11b35a5 Compare September 13, 2024 12:10
@Honny1
Copy link
Member Author

Honny1 commented Sep 24, 2024

@mheon @Luap99 PTAL, I have resolved the feedback and checked that the defaults are propagated correctly.

if ee.HealthLog != "" {
m["PODMAN_HEALTH_LOG"] = ee.HealthLog
}
m["PODMAN_HEALTH_FAILING_STREAK"] = strconv.Itoa(ee.HealthFailingStreak)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only write this when we actually write a hc event.
Right now all events would get PODMAN_HEALTH_FAILING_STREAK set to 0 which doesn't make sense, I guess the same issue applies to PODMAN_HEALTH_STATUS

Comment on lines 64 to 71
// HealthLogDestination defines the destination where the log is stored
HealthLogDestination string `json:"HealthLogDestination,omitempty" default:"local"`
// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file.
// ('0' value means an infinite number of attempts in the log file)
HealthMaxLogCount uint `json:"HealthcheckMaxLogCount,omitempty" default:"5"`
// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log
// ("0" value means an infinite log length)
HealthMaxLogSize uint `json:"HealthcheckMaxLogSize,omitempty" default:"500"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What uses the default key? AFAICT we never used this syntax anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would force a default value. For example, instead of 0 to 5. But it's not. I'll remove it.

Comment on lines 416 to 423
// HealthLogDestination defines the destination where the log is stored
HealthLogDestination string `json:"healthLogDestination,omitempty" default:"local"`
// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file.
// ('0' value means an infinite number of attempts in the log file)
HealthMaxLogCount uint `json:"healthMaxLogCount,omitempty" default:"5"`
// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log
// ("0" value means an infinite log length)
HealthMaxLogSize uint `json:"healthMaxLogSize,omitempty" default:"500"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here default key, this is a duplication if we ever change the default

Comment on lines 649 to 648
} else {
options = append(options, libpod.WithHealthCheckLogDestination(define.DefaultHealthCheckLocalDestination))
options = append(options, libpod.WithHealthCheckMaxLogCount(define.DefaultHealthMaxLogCount))
options = append(options, libpod.WithHealthCheckMaxLogSize(define.DefaultHealthMaxLogSize))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing, I think it is better when defaults are set in libpod on not by specgen.

Comment on lines 602 to 609
// HealthLogDestination defines the destination where the log is stored
HealthLogDestination string `json:"healthLogDestination,omitempty" default:"local"`
// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file.
// ('0' value means an infinite number of attempts in the log file)
HealthMaxLogCount uint `json:"healthMaxLogCount,omitempty" default:"5"`
// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log
// ("0" value means an infinite log length)
HealthMaxLogSize uint `json:"healthMaxLogSize,omitempty" default:"500"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with the default duplciation.

Comment on lines 296 to 297
count=$(grep -o "$msg" <<< "$output" | wc -l)
assert "$count" -eq 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this produces nice errors either but I leave to @edsantiago to judge.

@@ -273,4 +273,215 @@ Log[-1].Output | \"Uh-oh on stdout!\\\nUh-oh on stderr!\\\n\"
done
}

@test "podman healthcheck --health-max-log-count default value (5)" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem there is rather significant duplication in what is being tested in e2e and system tests. We should not have duplicate tests as this just wastes CI time. One of them is good enough. I think using system tests make sense here as they run in gating.

And it is extremely difficult for me to figure out if these tests are indeed the same or if there are some minor differences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests test the same functionality. I will remove the e2e tests.

is "$output" "" "output from 'podman healthcheck run'"
done

# sys podman tests executes healthcheck twice the first time healthcheck is run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, you are running them twice... The healthcheck is trigger by podman when the container is started but because you manually run_podman healthcheck run you run it twice not podman.

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way too much duplication in the tests. Can you find a way to refactor? Also a few other suggestions for corrections.

I'll take another look once you deduplicate.

@@ -273,4 +273,215 @@ Log[-1].Output | \"Uh-oh on stdout!\\\nUh-oh on stderr!\\\n\"
done
}

@test "podman healthcheck --health-max-log-count default value (5)" {
local repeat_count=10
local msg="Hello, How are you?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to use randomized strings:

    local msg="healthmsg-$(random_string)"

This prevents false negatives from caching, or copypasta, or various other types of logic errors. The "healthmsg" prefix will make it very easy for a future maintainer, looking at test failures, to see what type of string this is and to grep the code for it.

cid="$output"

run_podman inspect $ctrname --format "{{.Config.HealthMaxLogCount}}"
is "$output" "5" "HealthMaxLogCount is set to 5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop the "is set to 5": on failure, the diagnostic message will already say "expected 5, got (something else)". And if the expected value ever changes, someone might change the "5" but forget to change the "is set to 5", and the code will be impossible to understand.

Also, where does this magic 5 come from? Is it a hardcoded default in podman? Is it a containers.conf setting? (I could read the PR but this is way too huge right now). What will happen if/when that default changes or is overridden? Maybe the message should say "HealthMaxLogCount is the expected default", or some explanation of where it's coming from

for _ in $(seq 1 $repeat_count);
do
run_podman healthcheck run $ctrname
is "$output" "" "output from 'podman healthcheck run'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "unexpected output from podman healthcheck run (pass $i)" (and use i instead of _). Any time there's a loop, it's nice to include the variables in the error message. Not super helpful in this particular case, but it's a good habit to get into.



run_podman inspect $ctrname --format "{{.State.Health.Log}}"
count=$(grep -o "$msg" <<< "$output" | wc -l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not grep -c? As in, count=$(grep -co "$msg" <<<"$output")


run_podman inspect $ctrname --format "{{.State.Health.Log}}"
count=$(grep -o "$msg" <<< "$output" | wc -l)
assert "$count" -eq 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the assertion, e.g. assert "$count" -eq 5 "Number of matching health log lines"

@Honny1 Honny1 force-pushed the healthcheck-log branch 3 times, most recently from 909bc6d to 49f036d Compare September 24, 2024 19:03
@Honny1
Copy link
Member Author

Honny1 commented Sep 24, 2024

@edsantiago @Luap99 PTAL, I've modified the code according to your feedback.

@edsantiago
Copy link
Member

Looking, but, test flakes rootless on my laptop:

✗ |220| podman healthcheck --health-max-log-size infinite value (0) [3485]
...
...very very very very long string
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Number of matching health log messages
#| expected: -eq 2
#|   actual:     1
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@Honny1
Copy link
Member Author

Honny1 commented Sep 24, 2024

This is caused by the fact that podman run creates and starts the systemd timer right away, which starts the first run of HealtCheck when the container is created. Then the podman healthcheck run is manually triggered again in the test. That's why you can get a second run. However, this depends on systemd. I can avoid this by using the -ge comparison. WDYT? @edsantiago

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring. A few suggestions and requests inline. As for the flake, I have no opinion yet, don't understand the code well enough to comment.

One more suggestion: could you please add comments in important places? This all makes perfect sense to you, because you're deeply immersed in the code today, but it is very hard for me to read, and one day will be very hard for you too. Simple comments (some suggestions below) can save a maintainer precious time.

Thank you!

local ctrname="c-h-$(safename)"
_create_container_with_health_log_settings $ctrname $msg "{{.Config.HealthMaxLogCount}}" "--health-max-log-count $repeat_count" "$repeat_count" "HealthMaxLogCount"

# This is run 11 times to check that the cap is working.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: is run one more time than max-count. And instead of 0 $repeat_count, which is too subtle for a quick glance, $(seq 1 $repeat_count) plusone

run_podman healthcheck run $ctrname
is "$output" "" "output from 'podman healthcheck run'"

_check_health_log $ctrname "healthmsg-}]$" 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: substr=${msg:0:10} then _check ... "$substr}]\$". With an explanation that max-log-size truncates the echo message, because there are just way too many options and way too many magic "10"s all over, and it's confusing...



@test "podman healthcheck --health-log-destination file" {
local TMP_DIR_HEALTHCHECK="$BATS_FILE_TMPDIR/healthcheck"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use $PODMAN_TMPDIR. That's unique to each test.

Comment on lines 423 to 424
run cat $healthcheck_log_path
count=$(grep -co "$msg" <<< "$output")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: the run cat is unnecessary, you can just count=$(grep -co "$msg" $healthcheck_log_path)


# The healthcheck is triggered by the podman when the container is started, but its execution depends on systemd.
# And since `run_podman healthcheck run` is also run manually, it will result in two runs.
run journalctl --output cat --output-fields=PODMAN_HEALTH_LOG PODMAN_ID=$cid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bats run is evil. Or at least fiendishly tricky. It hides a lot of important stuff. Suggestion:

cmd="journalctl --output cat ...etc"
echo "$_LOG_PROMPT $cmd"
run $cmd
echo "$output"
assert "$status" -eq 0 "exit status of journalctl"

This is tedious and it's very unfortunate that we have to do things like this... but a future maintainer will greatly appreciate it when the test fails.

}

@test "podman healthcheck --health-max-log-count default value (5)" {
local repeat_count=10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: eliminate

local ctrname="c-h-$(safename)"
_create_container_with_health_log_settings $ctrname $msg "{{.Config.HealthMaxLogCount}}" "" "5" "HealthMaxLogCount is the expected default"

for i in $(seq 1 $repeat_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hardcode 10, and add a comment above, something like "run many more times than max count; confirm that excess ones are not logged"

…nation flags

These flags can affect the output of the HealtCheck log. Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via `podman inspect`.
It is also limited to the last five executions and the first 500 characters per execution.

This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic `healthcheck service failed` record.

- The `--health-log-destination` flag sets the destination of the HealthCheck log.
  - `none`: (default behavior) `HealthCheckResults` are stored in overlay containers. (For example: `$runroot/healthcheck.log`)
  - `directory`: creates a log file named `<container-ID>-healthcheck.log` with JSON `HealthCheckResults` in the specified directory.
  - `events_logger`: The log will be written with logging mechanism set by events_loggeri. It also saves the log to a default directory, for performance on a system with a large number of logs.

- The `--health-max-log-count` flag sets the maximum number of attempts in the HealthCheck log file.
  - A value of `0` indicates an infinite number of attempts in the log file.
  - The default value is `5` attempts in the log file.
- The `--health-max-log-size` flag sets the maximum length of the log stored.
  - A value of `0` indicates an infinite log length.
  - The default value is `500` log characters.

Add --health-max-log-count flag

Signed-off-by: Jan Rodák <[email protected]>

Add --health-max-log-size flag

Signed-off-by: Jan Rodák <[email protected]>

Add --health-log-destination flag

Signed-off-by: Jan Rodák <[email protected]>
@Honny1
Copy link
Member Author

Honny1 commented Sep 25, 2024

@edsantiago PTAL, I've modified the tests according to your suggestions.

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2024
@containers containers deleted a comment from packit-as-a-service bot Sep 25, 2024
@Honny1 Honny1 requested a review from Luap99 September 25, 2024 15:28
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2024
Copy link
Contributor

openshift-ci bot commented Sep 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Honny1, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 4e38381 into containers:main Sep 26, 2024
74 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants